{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Nested cross-validation\n",
    "\n",
    "Cross-validation can be used both for hyperparameter tuning and for estimating\n",
    "the generalization performance of a model. However, using it for both purposes\n",
    "at the same time is problematic, as the resulting evaluation can underestimate\n",
    "some overfitting that results from the hyperparameter tuning procedure itself.\n",
    "\n",
    "Philosophically, hyperparameter tuning is a form of machine learning itself\n",
    "and therefore, we need another outer loop of cross-validation to properly\n",
    "evaluate the generalization performance of the full modeling procedure.\n",
    "\n",
    "This notebook highlights nested cross-validation and its impact on the\n",
    "estimated generalization performance compared to naively using a single level\n",
    "of cross-validation, both for hyperparameter tuning and evaluation of the\n",
    "generalization performance.\n",
    "\n",
    "We will illustrate this difference using the breast cancer dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_breast_cancer\n",
    "\n",
    "data, target = load_breast_cancer(return_X_y=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we use `GridSearchCV` to find the best parameters via cross-validation\n",
    "on a minimal parameter grid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "param_grid = {\"C\": [0.1, 1, 10], \"gamma\": [0.01, 0.1]}\n",
    "model_to_tune = SVC()\n",
    "\n",
    "search = GridSearchCV(estimator=model_to_tune, param_grid=param_grid, n_jobs=2)\n",
    "search.fit(data, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We recall that, internally, `GridSearchCV` trains several models for each on\n",
    "sub-sampled training sets and evaluate each of them on the matching testing\n",
    "sets using cross-validation. This evaluation procedure is controlled via using\n",
    "the `cv` parameter. The procedure is then repeated for all possible\n",
    "combinations of parameters given in `param_grid`.\n",
    "\n",
    "The attribute `best_params_` gives us the best set of parameters that maximize\n",
    "the mean score on the internal test sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The best parameters found are: {search.best_params_}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also show the mean score obtained by using the parameters `best_params_`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The mean CV score of the best model is: {search.best_score_:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this stage, one should be extremely careful using this score. The\n",
    "misinterpretation would be the following: since this mean score was computed\n",
    "using cross-validation test sets, we could use it to assess the generalization\n",
    "performance of the model trained with the best hyper-parameters.\n",
    "\n",
    "However, we should not forget that we used this score to pick-up the best\n",
    "model. It means that we used knowledge from the test sets (i.e. test scores)\n",
    "to select the hyper-parameter of the model it-self.\n",
    "\n",
    "Thus, this mean score is not a fair estimate of our testing error. Indeed, it\n",
    "can be too optimistic, in particular when running a parameter search on a\n",
    "large grid with many hyper-parameters and many possible values per\n",
    "hyper-parameter. A way to avoid this pitfall is to use a \"nested\"\n",
    "cross-validation.\n",
    "\n",
    "In the following, we will use an inner cross-validation corresponding to the\n",
    "previous procedure above to only optimize the hyperparameters. We will also\n",
    "embed this tuning procedure within an outer cross-validation, which is\n",
    "dedicated to estimate the testing error of our tuned model.\n",
    "\n",
    "In this case, our inner cross-validation always gets the training set of the\n",
    "outer cross-validation, making it possible to always compute the final testing\n",
    "scores on completely independent sets of samples.\n",
    "\n",
    "Let us do this in one go as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, KFold\n",
    "\n",
    "# Declare the inner and outer cross-validation strategies\n",
    "inner_cv = KFold(n_splits=5, shuffle=True, random_state=0)\n",
    "outer_cv = KFold(n_splits=3, shuffle=True, random_state=0)\n",
    "\n",
    "# Inner cross-validation for parameter search\n",
    "model = GridSearchCV(\n",
    "    estimator=model_to_tune, param_grid=param_grid, cv=inner_cv, n_jobs=2\n",
    ")\n",
    "\n",
    "# Outer cross-validation to compute the testing score\n",
    "test_score = cross_val_score(model, data, target, cv=outer_cv, n_jobs=2)\n",
    "print(\n",
    "    \"The mean score using nested cross-validation is: \"\n",
    "    f\"{test_score.mean():.3f} \u00b1 {test_score.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reported score is more trustworthy and should be close to production's\n",
    "expected generalization performance. Note that in this case, the two score\n",
    "values are very close for this first trial.\n",
    "\n",
    "We would like to better assess the difference between the nested and\n",
    "non-nested cross-validation scores to show that the latter can be too\n",
    "optimistic in practice. To do this, we repeat the experiment several times and\n",
    "shuffle the data differently to ensure that our conclusion does not depend on\n",
    "a particular resampling of the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_score_not_nested = []\n",
    "test_score_nested = []\n",
    "\n",
    "N_TRIALS = 20\n",
    "for i in range(N_TRIALS):\n",
    "    # For each trial, we use cross-validation splits on independently\n",
    "    # randomly shuffled data by passing distinct values to the random_state\n",
    "    # parameter.\n",
    "    inner_cv = KFold(n_splits=5, shuffle=True, random_state=i)\n",
    "    outer_cv = KFold(n_splits=3, shuffle=True, random_state=i)\n",
    "\n",
    "    # Non_nested parameter search and scoring\n",
    "    model = GridSearchCV(\n",
    "        estimator=model_to_tune, param_grid=param_grid, cv=inner_cv, n_jobs=2\n",
    "    )\n",
    "    model.fit(data, target)\n",
    "    test_score_not_nested.append(model.best_score_)\n",
    "\n",
    "    # Nested CV with parameter optimization\n",
    "    test_score = cross_val_score(model, data, target, cv=outer_cv, n_jobs=2)\n",
    "    test_score_nested.append(test_score.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can merge the data together and make a box plot of the two strategies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "all_scores = {\n",
    "    \"Not nested CV\": test_score_not_nested,\n",
    "    \"Nested CV\": test_score_nested,\n",
    "}\n",
    "all_scores = pd.DataFrame(all_scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
    "all_scores.plot.box(color=color, vert=False)\n",
    "plt.xlabel(\"Accuracy\")\n",
    "_ = plt.title(\n",
    "    \"Comparison of mean accuracy obtained on the test sets with\\n\"\n",
    "    \"and without nested cross-validation\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that the generalization performance estimated without using nested\n",
    "CV is higher than what we obtain with nested CV. The reason is that the tuning\n",
    "procedure itself selects the model with the highest inner CV score. If there\n",
    "are many hyper-parameter combinations and if the inner CV scores have\n",
    "comparatively large standard deviations, taking the maximum value can lure the\n",
    "naive data scientist into over-estimating the true generalization performance\n",
    "of the result of the full learning procedure. By using an outer\n",
    "cross-validation procedure one gets a more trustworthy estimate of the\n",
    "generalization performance of the full learning procedure, including the\n",
    "effect of tuning the hyperparameters.\n",
    "\n",
    "As a conclusion, when optimizing parts of the machine learning pipeline (e.g.\n",
    "hyperparameter, transform, etc.), one needs to use nested cross-validation to\n",
    "evaluate the generalization performance of the predictive model. Otherwise,\n",
    "the results obtained without nested cross-validation are often overly\n",
    "optimistic."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}